19.3 MAP Inference and Sparse CodingΒΆ
An alternative form of inference p(v|h) is to compute the single most likely value of the missing variables, rather than to infer the entire distribution over their possible values.
AKA, maximum a posteriori inference.
It is helpful to think of MAP inference as a procedure that provides a value of q of L(v, h, q) or ELBO. In this sense we can think of MAP inference as approximate inference, because it does not provide the optimal q.
Review on ELBO:
with respect to q over an unrestricted family of probablistic disrtibution, using an exact optimization algorithm. We can derive MAP inference as a form of approximate inference by restricting the family of distribution q may be drawn from.
This means that we can control q entirely via \(\mu\). Dropping terms of L that do not vary with \(\mu\), we are left with the optimization problem:
equivalent to MAP inference problem:
Learning procedure similar to EM algorithm, which alternate between:
- Perform MAP inference to infer \(h^*\)
- Update \(\theta\) to increase \(\log p(h^*, v)\)
MAP inference is commonly used in DL as both a feature extractor and a learning machanism. It is primarily used for sparse coding models.
Review on Sparse Coding:
Sparse coding is a linear factor model that imposes a sparsity-inducing prior on its hidden units. A common choice is a factorial Laplace prior, with
The visible units are then generated by performing a linear transformation and adding noise:
- Challenge: Every pair of variables \(h_i\) and \(h_j\) are both parents of v => if v is observed, the graphical model contains an active path connecting \(h_i\) and \(h_j\) => All the hidden units participate in one massive clique in p(h|v). => Computing or even representing p(h|v) is difficult.
- Solution: Use MAP inference and learn parameters by maximizing ELBO defined by the Dirac distribution around the MAP estimate of h.
If we concatenate all h vectors in the training set into a matrix H and concatenate all the v vectors into matrix V, then sparse coding learning process consists of minimizing:
We can minimize J by alternating between
- minimization with repect to H, convex problem
- minimization with repect to W, convex problem